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Background Of T^" Tnvention 

A number of approaches have been used in the past 
for applying the analytic power of mass spectrometry to 
peptides. Tandem mass spectrometry (MS/MS) techniques have 
been particularly useful. In tandem mass spectrometry, the 
peptide or other input (commonly obtained from a 
chromatography device) is applied to a first mass spectrometer 
which serves to select, from a mixture of peptides, a target 
peptide of a particular mass. The target peptide is then 
activated or fragmented to produce a mixture of the -target- 
or parent peptide and various component fragments, typically 
peptides of smaller mass. This mixture is then transmitted to 
a second mass spectrometer which records a fragment spectrum. 
This fragment spectrum will typically be expressed in the form 
of a bar graph having a plurality of peaks, each peak 
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indicating the mass-to-change ratio (m/z) of a detected 
fragment and having an intensity value. 

Although the bare fragment spectrum can be of some 
interest, it is often desired to use the fragment spectrum to 
identify the peptide (or the parent protein) which resulted in 
the fragment mixture. Previous approaches have typically 
involved using the fragment spectrum as a basis for 
hypothesizing one or more candidate amino acid sequences. 
This procedure has typically involved human analysis by a 
skilled researcher, although at least one automated procedure 
has been described. John Yates, III, et al., "Computer Aided 
Interpretation of Low Energy MS/MS Mass Spectra of Peptides" 
Tommies Tn Protein Chemistry J.J (1991), pp. 477-485, 
incorporated herein by reference. The candidate sequences can 
then be compared with known amino acid sequences of various 
proteins in the protein sequence libraries. 

The procedure which involves hypothesizing 
candidate amino acid sequences based on fragment spectra is 
useful in a number of contexts but also has certain 
difficulties. interpretation of the fragment spectra so as to 
produce candidate amino acid sequences is time-consuming, 
often inaccurate, highly technical and in general can be 
performed only by a few laboratories with extensive experience 
in tandem mass spectrometry. Reliance on human interpretatxon 
often means that analysis is relatively slow and lacks strict 
objectivity. Approaches based on peptide mass mapping are 
limited to peptide masses derived from an intact homogenous 
protein generated by specific and known proteolytic cleavage 
and thus are not generally applicable to mixtures of proteins. 

Accordingly, it would be useful to provide a system 
for correlating fragment spectra with known protein sequences 
while avoiding the delay and/or subjectivity involved in 
hypothesizing or deducing candidate amino acid sequences from 
the fragment spectra. 
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QiiTnmarv o f The Invention 
According to the present invention, known amino 
acid sequences, e.g., in a protein sequence library, are used 
to calculate or predict one or more candidate fragment 
spectra. The predicted fragment spectra are then compared 
with an experimentally-derived fragment spectrum to determine 
the best match or matches. Preferably, the parent peptide, 
from which the fragment spectrum was derived has a known mass. 
Sub- sequences of the various sequences in the protein 
sequence library are analyzed to identify those sub-sequences 
corresponding to a peptide whose mass is equal to (or within a 
given tolerance of) the mass of the parent peptide in the 
fragment spectrum. For each sub-sequence having the proper 
mass, a predicted fragment spectrum can be calculated, e.g., 
by calculating masses of various amino acid subsets of the 
candidate peptide. The result will be a plurality of 
candidate peptides, each with a predicted fragment spectrum. 
The predicted fragment spectra can then be compared with the 
fragment spectrum derived from the tandem mass spectrometer to 
identify one or more proteins having sub-sequences which are^ 
likely to be identical with the sequence of the peptide which 
resulted in the experimentally-derived fragment spectrum. 

Fig 1 is a block diagram depicting previous 
methods for correlating tandem mass spectrometer data with 
sequences from a protein sequence library; 

Fig 2 is a block diagram showing a method for 
correlating ta'nde* — spectrometer data with 
a protein sequence library according to an enbodrment of the 

present invention; 

Fig 3 is a flow chart showing steps for 
correlating tandem mass spectrometry data with amino acid 
sequences, according to an embodiment of the present 
invention ; 
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Fig 4 is a flow diagram showing details of a 
method for the step of identifying candidate sub-sequences of 

Fig- 3; , . . _ 

Fig. 5 is a fragment mass spectrum for a peptide of 

a type that can be- used in connection with the present 

invention; and 

Figs. 6A-6D are flow charts showing an analysis 

method, according to an embodiment of the present invention. 

p oc rripti"n of The specific Embodiments 
Before describing the embodiments of the present 
invention, it will be useful to describe, in greater detail a 
previous method. As depicted in Fig. 1, the previous method 
is used for analysis of an unknown peptide 12 . Typically the 
peptide will be output from a chromatography column which has 
been used to separate a partially fractionated protein. The 
protein can be fractionated by, for example, gel filtration 
chromatography and/or high performance liquid chromatography 
(HPLC) . The sample 12 is introduced to a tandem mass 
spectrometer 14 through an ionization method such as 
electrospray ionization (ES) . in the first mass spectrometer, 
a peptide ion is selected, so that a targeted component of a 
specific mass, is separated from the rest of the sample 14a. 
The targeted component is then activated or decomposed. In 
the case of a peptide, the result will be a mixture of the 
ionized parent peptide ("precursor ion") and component 
peptides of lower mass which are ionized to various states. A 
number of activation methods can be used including collisions 
with neutral gases (also referred to as collision induced 
dissolution) . The parent peptide and its fragments are then 
provided to the second mass spectrometer 14c, which outputs an 
intensity and m/z for each of the plurality of fragments in 
the fragment mixture. This information can be output as a 
fragment mass spectrum 16. Fig. 5 provides an example of such 
a spectrum 16. In the spectrum 16 each fragment ion is 
represented as a bar graph whose abscissa value indicates the 
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„ass-to-ch»rge ratio (»/*) and whose ordinate value represents 
intensity. According to previous methods, in order to 
correlate , fragment spectra with sequences from a protein 
sequence library, a fragment sequence was converted into one 
or more amino acid sequences Judged to corres pond to the 
fragment spectrum. In one strategy, the ^ «- 

amino acids is subtracted fro. the molecular weight of the 
parent ion to determine what might be the molecular weight of 
, fragment assuming, respectively, each ammo acid is in the 
termini position. It is determined if this fragment mass is 
found in the actual measured fragment spectrum. Scores are 
generated for each of the amino acids and the 
sorted to generate a list of partial sequences for the next 
section cycle. Cycles continue until ° f 

mass Of an amino acid leaves , difference of less than 0^ and 
greater than -0.5. The result is one or more candidate amino 
acid sequences 18 . This procedure can be automated as 
described, for example, in Vates III (1991, s^. one or 
more of the highest-scoring candidate sequences can then be 
compared 21 to sequences in a protein sequence >^y 20 *<> 
try to identify a protein having a sub-sequence similar or 
identical to the sequence believed to correspond to the 
peptide Which generated the fragment spectrum 16. 

Fig 2 shows an overview of a process according to 
the present invention. According to the process of « 
fragment spectrum 16 is obtained in a manner similar to that 
described above for the fragment spectrum depicted in Fig. 1. 
specifically, the sample 12 is provided to a tandem mass 
spectrometer 14 . Procedures described below use • 
process to acquire »s/ms data. However the present invention 
can also be used in connection with mass spectrometry 
approaches currently being developed which incorporate 
acquisition of ms/ms data with a single step. In one 
embodiment ms/ms spectra would be acquired at each mass^ The 
first ms would separate the ions by mass-to-charge and the 
second would record the ms/ms spectrum. The second stage of 
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ms/ms would acquire , e.g. 5 to 10 spectra at each mass 
transformed by the first ms. 

A number of mass spectrometers can be used 
including a triple-quadruple mass spectrometer, a Fourier- 
transform cyclotron resonance mass spectrometer, a tandem 
time-of -flight mass spectrometer and a quadrupole ion trap 
mass spectrometer. In the process of Fig. 2 f however, it is 
not necessary to use the fragment spectrum as a basis for 
hypothesizing one or more amino acid sequences. In the 
process of Fig. 2, sub-sequences contained in the protein 
sequence library 2 0 are used as a basis for predicting a 
plurality of mass spectra 22, e.g., using prediction 
techniques described more fully below. 

A number of sequence libraries can be used, 
including, for example, the Genpept database, the GenBank 
database (described in Burks, et al., "GenBank: Current status 
and future directions in Methods in Enzymology* 1 , 183:3 
(1990)), EMBL data library (described in Kahn, et al., "EMBL 
Data Library," Methods in Enzymology , 183:23 (1990)), the 
Protein Sequence Database (described in Barker, et al., 
"Protein Sequence Database," Methods in Enzv moloov, 1983:31 
(1990), SWISS-PROT (described in Bairoch, et al., "The SWISS- 
PROT protein sequence data bank, recent developments," Nucleic 
Acids Res. . 21:3093-3096 (1993)), and PIR-International 
(described in "Index of the Protein Sequence Database of the 
International Association of Protein Sequence Databanks (PIR- 
International)" Protein Seg Data Anal. 5:67-192 (1993). 

The predicted mass spectra 22 are compared 24 to 
the experimentally-derived fragment spectrum 16 to identify 
one or more of the predicted mass spectra which most closely 
match the experimentally-derived fragment spectrum 16. 
Preferably the comparison is done automatically by calculating 
a closeness-of-f it measure for each of the plurality of 
predicted mass spectra 22 (compared to the experimentally- 
derived fragment spectrum 16) . It is believed that, in 
general , , there is high probability that the peptide analyzed 
by the tandem mass spectrometer has an amino acid sequence 
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identical to one of the sub-sequences taken from the protein 
sequence library 20 which resulted in a predicted mass 
spectrum 22 exhibiting a high closeness-of-f it with respect to 
the experimentally-derived fragment spectrum 16. Furthermore, 
when the peptide analyzed by the tandem mass spectrometer 14 
was derived from a protein, it is believed there is a high 
probability that the parent protein is identical or similar to 
the protein whose sequence in the protein sequence library 20 
includes a sub-sequence that resulted in a predicted mass 
spectra 22 which had a high closeness-of-f it with respect to 
the fragment spectrum 16. Preferably, the entire procedure 
can be performed automatically using, e.g, a computer to 
calculate predicted mass spectra 22 and/or to perform 
comparison 24 of the predicted mass spectra 22 with the 
experimentally-derived fragment spectrum 16. 

Fig 3 is a flow diagram showing one method for 
predicting mass spectra 22 and performing the comparison 24 
According to the method of Fig. 3, the experimentally-derived 
fragment spectrum 16 is first normalized 32. According to one 
normalization method, the experimentally-derived fragment 
spectrum 16 is converted to a list of masses and intensities. 
The values for the precursor ion are removed from the fxl.. 
The square root of all the intensity values is calculated and 
normalized to a maximum intensity of 100. The 200 most 
intense ions are divided into ten mass regions and the maximum 
intensity is normalized to 100 within each region Each ion 
which is within 3.0 daltons of its neighbor on either si* is 
given the greater intensity value, if a neighboring intensity 
is' greater than its own intensity. Of course, other 
normalizing methods can be *used and it is P^-lbl- to perform 
analysis without performing normalization, although 
normalization is, in general, preferred. For example it is 
possible to use maximum intensities with a value greater than 
possible t possible to select more or fewer than 

or less than luu. j-^- j-=» r m 

• T+- is Dossible to divide into more 

the 200 most intense ions. It is possioie 

^„:.„c it is possible to make the 
or fewer than 10 mass regions. it is po^x 
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window for assuming the neighboring intensity value to be 
greater than or less than 3.0 daltons. 

In order to generate predicted mass spectra from a 
protein sequence library, according to the process of Fig. 3, 
sub-sequences within each protein sequence are identified 
which have a mass which is within a tolerance amount of the 
mass of the unknown peptide. As noted above, the mass of the 
unknown peptide is known from the tandem mass spectrometer 34. 
Identification of candidate sub-sequences 3 4 is shown in 
greater detail in Fig. 4. In general, the process of 
identifying candidate sub-sequences involves summing the 
masses of linear amino acid sequences until the sum is either 
within a tolerance of the mass of the unknown peptide (the 
"target" mass) or has exceeded the target mass (plus 
tolerance) . If the mass of the sequence is within tolerance 
of the target mass, the sequence is marked as a candidate. If 
the mass of the linear sequence exceeds the mass of the 
unknown peptide, then the algorithm is repeated, beginning 
with the next amino acid position in the sequence. 

According to the method of Fig. 4, a variable m, 
indicating the starting amino acid along the sequence is 
initialized to 0 and incremented by 1 (36, 38) . The sum, 
representing the cumulative mass and a variable n representing 
the number of amino acids thus far calculated in the sum, are 
initially set to 0 (40) and variable n is incremented 42. The 
molecular weight of a peptide corresponding to a sub-sequence 
of a protein sequence is calculated in iterative fashion by 
steps 44 and 46. In each iteration, the sum is incremented by 
the molecular weight of the amino acid of the next (nth) amino 
acid in the sequence 44. Values of the sum 44 may be stored 
for use in calculating fragment masses for use in predicting a 
fragment mass spectrum as described below. If the resulting 
sum is less than the target mass decremented by a tolerance 
46, the value of n is incremented 42 and the process is 
repeated 44. A number of tolerance values can be used. In 
one embodiment, a tolerance value of ±0.05% of the mass of the 
unknown peptide was used. If the new sum is no longer less 
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than a tolerance amount below the target mass, it is then 
determined if the new sum is greater than the target mass plus 
the tolerance amount. If the new sum is more than the 
tolerance amount in excess of the target mass, this particular 
sequence is not considered a candidate sequence and the 
process begins again, starting from a new starting point in 
the sequence . (by incrementing the starting point value m 
(38)). If, however, the sum is not greater than the target 
mass plus the tolerance amount, it is known that the sum is 
within one tolerance amount of a target mass and, thus, that 
the sub-sequence beginning with mth amino and extending to the 
(m + n)th amino acid of the sequence is a candidate sequence. 
The candidate sequence is marked, e.g., by storing the values 
of m and n to define this sub-sequence. 

Returning to Fig. 3, once a plurality of candidate 
sub-sequences have been identified, a fragment mass spectrum 
is predicted for each of the candidate sequences 52. The 
fragment mass spectrum is predicted by calculating the 
fragment ion masses for the type b- and y- ions for the amino 
acid sequence. when a peptide is fragmented and the charge is 
retained on the N-terminal cleavage fragment, the resulting 
ion is labelled as a b-type ion. If the charge is retained on 
the c-type terminal fragment, it is labelled a y-type ion. 
Masses for type b- ions were calculated by summing the amino 
acid masses and adding the mass of a proton. Type y- ions 
were calculated by summing, from the c-terminus, the masses of 
the amino acids and adding the mass of water and a proton to 
the initial amino acid. In this way, it is possible to 
calculate an m/z for each fragment. However, in order to 
provide a predicted mass spectrum, it is also necessary to 
assign an intensity value for each fragment. It might be 
possible to predict, on a theoretical basis, intensity value 
for each fragment. However, this procedure is difficult. It 
has been found useful to. assign intensities in the following 
fashion. The value of 50.0 is assigned to each b and y ion. 
To masses of 1 dalton on either side of the fragment ion, an 
intensity of 25.0 is assigned. Peak intensities of 10.0 and - 
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17.0 and -18.0 daltons below the m/z of each b- and y- ion 
location (for both NH 3 and H 2 0 loss), and peak intensities of 
10.0 and -28.0 amu of each type b ion location (for type a- 
ions) . 

Returning to Fig. 3, after calculation of predicted 
m/z values and assignment of intensities, it is preferred to 
calculate a measure of closeness-of-f it between the predicted 
mass spectra 22 and the experimentally-derived fragment 
spectrum 16. A number of methods for calculating closeness- 
of-f it are available. In the embodiment depicted in Fig. 3, a 
two-step method is used 54 . The two-step method includes 
calculating a preliminary closeness-of-f it score, referred to 
here as S p 56 and, for the highest-scoring amino acid 
sequences, calculating a correlation function 58. According 
to one embodiment, S p is calculated using the following 
formula: 

5 p =(E i «. ) * n i* (1+p) * <1 " p)/ - n - 
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where i m - matched intensities, ni = number of matched 
fragment ions, > = type b- and y-ion continuity, p =' presence 
of immonium ions and their respective amino acids in the 
predicted sequence, n t = total number of fragment ions. The 
factor, jS, evaluates the continuity of a fragment ion series. 
If there was a fragment ion match for the ion immediately 
preceding the current type b- or y-ion, 0 is incremented by 
0.075 (from an initial value of 0.0). This increases the 
preliminary score for those peptides matching a successive 
series of type b- and y-ions since extended series of ions of 
the same type are often observed in MS/MS spectra. The factor 
p evaluates the presence of immonium ions in the low mass end 
of the mass spectrum. Immonium ions are diagnostic for the 
presence of some types of amino acids in the sequence. If 
immonium ions are present at 110.0, 120.0, or 136.0 Da (± 1.0 
Da) in the processed data file of the unknown peptide with 
normalized intensities greater than 40.0, indicating the 
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presence of histidine, phenylalanine, and tyrosine 
respectively, then the sequence under evaluation is checked 
for the presence of the amino acid indicated by the immonium 
ion. The preliminary score, S p , for the peptide is either 
5 augmented or depreciated by a factor of (1 - P) where p is the 

sum of the penalties for each of the three amino acids whose 
presence is indicated in the low mass region. Each individual 
p can take on the value of -0.15 if there is a corresponding 
low mass peak and the amino acid is not present in the 
10 sequence, +0.15 if there is a corresponding low mass peak and 

the amino acid is present in the sequence, or 0.0 if the low 
mass peak is not present. The total penalty can range from 
-0.45 (all three low mass peaks present in the spectrum yet 
none of the three amino acids are in the sequence) to +0.45 
15 (all three low mass peaks are present in the spectrum and all 

three amino acids are in the sequence) . 

Following the calculation of the preliminary 
closeness-of-fit score S p , those candidate predicted mass 
spectra having the highest S p scores are selected for further 
20 analysis using the correlation function 58. The number of 

candidate predicted mass spectra which are selected for 
further analysis will depend largely on the computational 
resources and time available. In one embodiment, 300 
candidate peptide sequences with the highest preliminary score 

25 were selected. 

For purposes of calculating the correlation 
function, 58, the experimentally-derived fragment spectrum is 
preprocessed in a fashion somewhat different from 
preprocessing 32 employed before calculating S p . For purposes 
of the correlation function, the precursor ion was removed 
from the spectrum and the spectrum divided into 10 sections, 
ions in each section were then normalized to 50.0. The 
sectionwise normalized spectra 60 were then used for 
calculating the correlation function. According to one 
35 embodiment, the discrete correlation between the two functions 

is calculated as: 
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n-l 

2=0 

where t is a lag value. The discrete correlation theorem 
states that the discrete correlation of two real functions x 
and y is one member of the discrete Fourier transform pair 

^y'T < 3 > 



where X(t) and Y(t) are the discrete Fourier transforms of 
x(i) and y(i) and the Y* denotes complex conjugation. 
Therefore, the cross-correlations can be computed by Fourier 
transformation of the two data sets using the fast Fourier 
transform (FFT) algorithm, multiplication of one transform by 
the complex conjugate of the other, and inverse transformation 
of the resulting product. In one embodiment, all of the 
predicted spectra as well as the pre-processed unknown 
spectrum were zero-padded to 4 096 data points since the MS/MS 
spectra are not periodic (as intended by the correlation 
theorem) and the FFT algorithm requires N to be an integer 
power of two, so the resulting end effects need to be 
considered. The final score attributed to each candidate 
peptide sequence is R(0) minus the mean of the 
cross-correlation function over the range -75<t<75. This 
modified "correlation parameter" described in Powell and 
Heiftje, ffMm. Acta . Vol. 100, pp 313-327 (1978) shows 

better discrimination over just the spectral correlation 
coefficient R(0) . The raw scores are normalized to 1.0. In 
one embodiment, output 62 includes the normalized raw score, 
the candidate peptide mass, the unnormalized correlation 
coefficient, the preliminary score, the fragment ion 
continuity P, the immonium ion factor p, the number of type b- 
and y-ions matched out of the total number of fragment ions, 
their matched intensities, the protein accession number, and 
the candidate peptide sequence. 

If desired, the correlation function 58 can be used 
to automatically select one of the predicted mass spectra 2 2 
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as corresponding to the experimentally-derived fragment 
spectrum 16. Preferably, however, a number of sequences from 
the library 20 are output and final selection of a single 
sequence is done by a skilled operator. 

In addition to predicting mass spectra from protein 
sequence libraries, the present invention also includes 
predicting mass spectra based on nucleotide databases. The 
procedure involves the same algorithmic approach of cycling 
through the nucleotide sequence. The 3-base codons will be 
converted to a protein sequence and the mass of the amino 
acids summed in a fashion similar to the summing depicted in 
Fig. 4. To cycle through the nucleotide sequence, a 1-base 
increment will be used for each cycle. This will allow the 
determination of an amino acid sequence for each of the three 
reading frames in one pass. The scoring and reporting 
procedures for the search can be the same as that described 
above for the protein sequence database. 

Depending on the computing and time resources 
available, it may be advantageous to employ data-reduction 
techniques. Preferably these techniques will emphasize the 
most informative ions in the spectrum while not unduly 
affecting search speed. One technique involves considering 
only some of the fragment ions in the MS/MS spectrum. A 
spectrum for a peptide may contain as many as 3,000 fragment 
ions. According to one data reduction strategy, the ions are 
ranked by intensity and some fraction of the most intense ions 
(e.g., the top 200 most intense ions) will be used for 
comparison. Another approach involves subdividing the 
spectrum into, e.g., 4 or 5 regions and using the 50 most 
intense ions in each region as part of the data set. Yet 
another approach involves selecting ions based on the 
probability of those ions being sequence ions. For example, 
ions could be selected which exist in mass windows of 57 
through 186 daltons (range of mass increments for the 20 
common amino acids from GLY to TRP) that contain diagnostic 
features of type b- or y- ions, such as losses of 17 or 18 
daltons (NH 3 or H 2 0) or a loss of 28 daltons (CO) . 
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The techniques described above are, in general, 
applicable to spectra of peptides with charged states of +1 or 
+2, typically having a relatively short amino acid sequence. 
Using a longer amino acid sequence increases the probability 
of a unique match to a protein sequence. However, longer 
peptide sequences have a greater likelihood of containing more 
basic amino acids, and thus producing ions of higher charge 
state under electro-spray ionization conditions. According to 
one embodiment of the invention, algorithms are provided for 
searching a database with MS/MS spectra of highly charged 
peptides (+3, +4, +5, etc.). According to one approach, the 
search program will include an input for the charge state (N) 
of the precursor ion used in the MS/MS analysis. Predicted 
fragment ions will be generated for each charge state less 
than N. Thus, for peptide of +4, the charge states of +1, +2 
and +3 will be generated for each fragment ion and compared to 

the MS/MS spectrum. 

The second strategy for use with multiply charged 

spectra is the use of mathematical deconvolution to convert 

the multiply charged fragment ions to their singly charged 

masses. The deconvoluted spectrum will then contain the 

fragment ions for the multiply charged fragment ions and their 

singly charged counterparts. 

To speed up searches of the database, a directed- 
search approach can be used. In cases where experiments are 
performed on specific organisms or specific types of proteins, 
it is not necessary to search the entire database on the first 
pass. instead, a search sequence protein specific to a 
species or a class of proteins can be performed first. If the 
search does not provide reasonable answers, then the entire 

database is searched . 

A number of different scoring algorithms can be 

used for determining preliminary closeness of fit or 
correlation. In addition to scoring based on the number of 
matched ions multiplied by the sum of the intensity, scoring 
can be based on the percentage of continuous sequence coverage 
represented by the sequence ions in the spectrum. For 
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example, a 10 residue peptide will potentially contain 9 each 
of b- and y-type sequence ions. If a set of ions extends from 
B, to B 9 , then a score of 100 is awarded, but if a 
discontinuity is observed in the middle of the sequence, such 
as missing the B 5 ion, a penalty is assessed. The maximum 
score is awarded for an amino acid sequence that contains a 
continuous ion series in both the b and y directions. 

In the event the described scoring procedures do 
not delineate an answer, an additional technique for spectral 
comparison can be used in which the database is initially 
searched with a molecular weight value and a reduced set of 
fragment ions. Initial filtering of the database occurs by 
matching sequence ions and generating a score with one of the 
methods described above. The resulting set of answers will 
then undergo a more rigorous inspection process using a 
modified full MS /MS spectrum. For the second stage analysis, 
one of several spectral matching approaches developed for 
spectral library searching is used. This will require 
generating a "library spectrum" for the peptide sequence based 
on the sequence ions predicted for that amino acid sequence, 
intensity values for sequence ions of the "library spectrum- 
will be obtained from the experimental spectrum. If a 
fragment ion is predicted at m/z 256, then the intensity value 
for the ion in the experimental spectrum at m/z=256 will be 
used as the intensity of the ion in the predicted spectrum. 
Thus, if the predicted spectrum is identical to the "unknown- 
spectrum, it will represent an ideal spectrum. The spectra 
will then be compared using a correlation function. In 
general, it is believed that the majority of computational 
time for the above procedure is spent in the iterative search 
process. By multiplexing the analysis of multiple MS/MS 
spectra in one pass through the database, an overall 
improvement in efficiency will be realized. In addition, the 
mass tolerance used in the initial pre-f iltering can affect 
search times by increasing or decreasing the number of 
sequences to analyze in subsequent steps. Another approach to 
speed up searches involves a binary encryption scheme where 
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the mass spectrum is encoded as peak/no peak at every mass 
depending on whether the peak is above a certain threshold 
value. If intensive use of a protein sequence library is 
contemplated, it may be possible to calculate and store 
predicted mass values of all sub-sequences within a 
predetermined range of masses so that at least some of the 
analysis can be performed by table look-up rather than 
calculation . 

Figs. 6A-6E are flow charts showing an analysis 
procedure according to one embodiment of the present 
invention. After data is acquired from the tandem mass 
spectrometer, as described above 602, the data is saved to a 
file and converted to an ASCII format 604. At this point, a 
preprocessing procedure is started 606. The user enters 
information regarding the peptide mass in the precursor ion 
charge state 608. Mass/ intensity values are loaded from the 
ASCII file, with the values being rounded to unit masses 610. 
The previously-identified precursor ion contribution of this 
data is removed 612. The remaining data is normalized to a 
maximum intensity of 100 614. At this point, different paths 
can be taken. In one case, the presence of any immonium ions 
(H F and Y) is noted 616 and the peptide mass and immonium 
ion information is stored in a datafile 618. In another 
route, the 200 most intense peaks are selected 620. If two 
peaks are within a predetermined distance (e.g., 2 amu) of 
each other, the lower intensity peak is set equal to a greater 
intensity 622. After this procedure, the data is stored in a 
datafile for preliminary scoring 624. In another route, the 
data is divided into a number of windows, for example ten 
windows 626. Normalization is performed within each window, 
for example, normalizing to a maximum intensity of 50 628. 
This data is then stored in a datafile for final correlation 
scoring 630. This ends the preprocessing phase, according to 

this embodiment 632. 

The database search is started 634 and the search 

parameters and the data obtained from the preprocessing 

procedure (Fig. 6A) are loaded 636. A first batch of database 
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sequences is loaded 638 and a search procedure is run on a 
particular protein 640. The search procedure is detailed in 
Fig. 6C. As long as the end of the batch has not been reached 
the index is incremented 64 2 and the search routine is 
5 repeated 640. Once it is determined that the end of a batch 

has been reached 644, as long as the end of the database has 
not been reached, the second index 64 6 is incremented and a 
new batch of database sequences is loaded 638. Once the end 
of the database has been reached 628, a correlation analysis 
10 is performed 630 (as detailed in Fig. 6E) , the results are 

printed 632 and the procedure ends 634. 

When the search procedure is. started 638 (Fig. 6C) , 
an index II is set to zero 64 6 to indicate the start position 
of the candidate peptide within the amino acid being searched 
15 640. A second index 12, indicating the end position of the 

candidate peptide within the amino acid being searched, is 
initially set equal to II and the variable Pmass, indicating 
the accumulated mass of the candidate peptide is initialized 
to zero 648. During each iteration through a given candidate 
20 peptide 650 the mass of the amino acid at position 12 is added 

to Pmass 652. It is next determined whether the mass thus-far 
accumulated (Pmass) equals the input mass (i.e., the mass of 
the peptide) 654. In some embodiments, this test may be 
performed as plus or minus a tolerance rather than requiring 
25 strict equality, as noted above. If there is equality 

(optionally within a tolerance) an analysis routine is started 
656 (detailed in Fig. 6D) . Otherwise, it is determined 
whether Pmass is less than the input mass (optionally within a 
tolerance). If not, the index 12 is incremented 658 and the 
30 mass of the amino acid at the next position (the incremented 

12 position) is added to Pmass 652. If Pmass is greater than 
the input mass (optionally by more than a tolerance 660) it is 
determined whether index II is at the end of a protein 662. 
If so, the search routine exits 664. Otherwise, index II is 
35 incremented 666 so that the routine can start with a new start 

position of a candidate peptide and the search procedure 
returns to block 648. 
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When the analysis procedure is started 670 (Fig. 
6D) , data indicative of b- and y- ions for the candidate 
peptide are generated 672, as described above. It is 
determined whether the peak is within the top 200 ions 674. 
The peak intensity is summed and the fragmented match index is 
incremented 676. If previous b- or y- ions are matched 678, 
the (S index is incremented 680. Otherwise, it is determined 
whether all fragment ions have been analyzed. If not, the 
fragment index is incremented 684 and the procedure returns to 
block 674. Otherwise, a preliminary score such as S p , 
described above is calculated 686. If the newly-calculated S p 
is greater than the lowest score 688 the peptide sequence is 
stored 690 unless the sequence has already been stored, in 
which case the procedure exits 69 2. 

At the beginning of the correlation analysis (Fig. 
6E) , a stored candidate peptide is selected 693. A 
theoretical spectrum for the candidate peptide is created 694, 
correlated with experimental data 69 5 and a final correlation 
score is obtained 696, as described above. The index is 
incremented 697 and the process repeated from block 69 3 unless 
all candidate peptides have been scored 698, in which case the 
correlation analysis procedure exits 699. 

The following examples are offered by way of 

illustration, not limitation. 

Experimental 
Example ftl 

MHC complexes were isolated from HS-EBV cells 
transformed with HIA-DRB*0401 using antibody affinity 
chromatography. Bound peptides were released and isolated by 
filtration through a Centricon 10 spin column. The heavy 
chain of glycosaparginase from human leukocytes was isolated. 
Proteolytic digestions were performed by dissolving the 
protein in 50 mM ammonium bicarbonate containing 10 mM Ca 
pH 8l6. Trypsin was added in a ratio of 100/1 protein/ enzyme 

Analysis of the resulting peptide mixtures was 
performed by LC-MS and LC-MS/MS. Briefly, molecular weights 
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of peptides were recorded by scanning Q3 or Ql at a rate of 
400 Da/sec over a mass range of 300 to 1600 throughout the 
HPLC gradient. Sequence analysis of peptides was performed 
during a second HPLC analysis by selecting the precursor ion 
with a 6 amu (FWHH) wide window in Q 1 and passing the ions 
into a collision cell filled with argon to a pressure of 3-5 
mtorr. Collision energies were on the order of 20 to 50 eV. 
The fragment ions produced in Q 2 were transmitted to Q 3 and a 
mass range of 50 Da to the molecular weight of the precursor 
ion was scanned at 500 Da/ sec to record the fragment ions. 
The low energy spectra of 36 peptides were recorded and stored 
on disk. The genpept database contains protein sequences 
translated from nucleotide sequences. A text search of the 
database was performed to determine if the sequences for the 
peptide amino acid sequences used in the analysis were present 
in the database. Subsequently, a second database was created 
from the whole database by appending amino acid sequences for 

peptides not included. 

The spectrum data was converted to a list of masses 
and intensities and the values for the precursor ion were 

■hho file The square root of all the intensity 
removed from the file. ine =>y" . . r _ f 

values was calculated and normalized to a maximum intensity of 
10 0 0. All ions except the 200 most intense ions were removed 
from the file. The remaining ions were divided into 10 mass 
regions and the maximum intensity normalized to 100.0 within 
each region. Each ion within 3.0 daltons of its neighbor on 
either side was given the greater intensity value if the 
neighboring intensity was greater than its own intensity 
This processed data was stored for comparison to the candidate 
sequences chosen from the database search. The MS/MS spectrum 
was modified in a different manner for calculation of a 
correlation function. The precursor ion was removed from the 
spectrum and the spectrum divided into 10 equal sec tions. 
ions in each section were then normalized to 50.0 This 
spectrum was used to calculate the correlation coefficient 
against a predicted MS /MS spectrum for each ammo acid 
sequence retrieved from the database. 
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Amino acid sequences from each protein were 
generated by summing the masses, using average masses for the 
amino acids, of the linear amino acid sequence from the amxno 
terminus (n) . If themass of the linear sequence exceeded the 
mass of the unknown peptide, then the algorithm returned to 
the amino terminal amino acid and began summing amxno acxd 
masses from the n+1 position. This process was repeated untxl 
every linear amino acid sequence combination had been 
evaluated. When the mass of the amino acid sequence was 
within ± 0.05% (minimum of ±1 Da) of the mass of the unknown 
peptide, the predicted m/z values for the type b- and y-xons 
were generated and compared to the fragment ions of the 
unknown sequence. A preliminary score (S p ) was ^ated and 
the top 300 candidate peptide sequences wxth the hxghest 
preliminary score were ranked and stored. A final 
the top 300 candidate amino acid sequences was performed wxth 
a correlation function. Using this function a theoretxcal 
MS/MS spectrum for the candidate sequence was compared to the 
modified experimental MS /MS spectrum. Correlatxon 
coefficients were calculated, ranked and reported The fxnal 
results were ranked on the basis of the normalxzed correlatxon 

coefficient. . 

The spectrum shown in Fig. 5 was obtaxned by 

L C-MS/MS analysis of a peptide bound to a D R B*0401 MHC class 

II complex. A search of the genpept database contaxnxng 

74 938 protein sequences identified 384,398 peptxdes wxthxn a 

mass tolerance of ± 0.05% (minimum of ± lDa) of the molecular 

weight of this peptide. By comparing fragment xon patterns 

predicted for each of these amino acid sequences to the 

pre-processed MS/MS spectra and calculating a 

score, the number of candidate sequences was cutoff at 300. I 
correlation analysis was then performed with the P«^cted 
MS/MS spectra for each of these sequences and the modxfxed 
experimental MS/MS spectrum. The results of the search 
through the genpept database with the spectrum xn Fxg. 5 are 
displayed in Table 1. Two peptides of similar sequence, 
DLRSWTAADAAQISK [S.q. ID No. 1], DLRSWTAADAAQISQ [Seq. ID No. 
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2] , were identified as the highest scoring ^ 
values). Their correlation coefficients are identical so 
their rankings in Table i are arbitrary. The -no ac,d 
sequence DI^WTAADAAQISK [Seq. ID No . 1J occurs xn fxve 
proteins in the genpept database while the sequence 
OI^SWTAADAAQISQ [Seq. ID No. 2] occurs xn only one ^ ^ 
three sequences appear in immunologically related protexns and 
the rest^f the proteins appear to have no correlatxon to one 
Mother A second search using the same MS/MS spectrum was 

n e formed with the « sapiens subset of the 
to compare the results. These data are presented xn Table 
In both searches the correct sequence tied for the top 
position. Both amino acid sequences have identical 

ff ;^ pnts c although the sequences dxffer by 
correlation coefficients, c n , an- y ao , rt<I have the 

Lys and Gin at the C-terminus. These two amxno acxd have the 
same nominal mass and would be expected to produce sxmxlar 
MS/MS spectra. The sum of the normalized fragment 
• Laities I for the matched fragment ions for the two 

additional fragment ion in the preliminary 

Identifying 70% of the predicted fragment ions for this a»in 
acid se^ence in the pre-processed spectrum. These matches 
art defined as part of the preliminary scoring procedure. 
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Example #2 

To examine the complexity of the mixture of 
peptides obtained by proteolysis of the total proteins from S. 
cerevislae cells, 10 B cells were grown and harvested. After 
lysis, the total proteins were contained in -9 mL of solution. 
A 0.5'mL aliquot was removed for proteolysis with the enzyme 
trypsin. From this solution two microliters were directly 
injected onto a micro-LC (liquid chromatography) column for MS 
analysis. In a complex mixture of peptides it is conceivable 
that multiple peptide ions may exist at the same m/z and 
contribute to increased background, complicating MS/MS 
analysis and interpretation. To test the ability to obtain 
sequence information by MS/MS from these complex mixtures of 
peptides, ions from the mixture were selected with on-line 
MS/MS analysis. In no case were the spectra contaminated with 
fragment ions from other peptides. A partial list of the 
identified sequences is presented in Table 3. 
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S. cerevisiae Protein 



Table 3 



Sea. ID No 



35 



enolase . 
hypusine containing protein hpz 
phosphoglycerate kinase 
BMHl gene product 
pyruvate kinase 
phosphoglycerate kinase 

hexokinase 
enolase 
enolase 



1 
4 
5 
6 
7 
8 
9 

10 
11 



amino acid Sequence 



DPFAEDDWEAWSH 

APEGELGDSLjQTAFDEGK 

TGGGASLELLEGK 

QAFDDAIAELDTLSEESYK 

IPAGWQGLDNGPSER 

LPGTDVDLPALSEK 

IEDDPFENLEDTDDDFQK 

EEALDLIVDAIK 

NPTVEVELTTEK 



40 



45 



The MS /MS spectra presented in Table 1 were 
interpreted using the described database searching method. 
This method serves as a data pre-filter to match MS/MS spectra 
to previously determined amino acid sequences. Pre-f iltering 
the data allows interpretation efforts to be focused on 
previously unknown amino acid sequences. Results for some of 
the MS/MS spectra are shown in Table 4. No pre-assigning of 
sequence ions or manual interpretation is required prior to 
the search. However, the sequences must exist in the - _ 
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database. The algorithm first pre-processed the MS/MS data 
and then compared all the amino acid sequences in the database 
within ±1 amu of the mass of the precursor ion of the MS/MS 
spectrum. The predicted fragmentation patterns of the ammo 
acid sequences within the mass tolerance were compared to the 
experimental spectrum. Once an amino acid sequence was within 
this mass tolerance, a final closeness-of-f it measure was 
obtained by reconstructing the MS/MS spectra and performing a 
correlation analysis to the experimental spectrum. Table 4 
lists a number of spectra used to test the efficacy of the 
algorithm. 

The computer program described above has been 
modified to analyze the MS/MS spectra of phosphorylated 
peptides. in this algorithm all types of phosphorylation are 
considered such as Thr, Ser, and Tyr. Amino acid sequences 
are scanned in the database to find linear stretches of 
sequence that are multiples of 80 amu below the mass of the 
peptide under analysis. In the analysis each putative site of 
phosphorylation is considered and attempts to fit a 
reconstructed MS/MS spectrum to the experimental spectrum are 

made . 
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Table 4 

List of results obtained searching genpept and 
species specific databases using MS/MS spectra for the 
respective peptides. 



No . Mass 



Seq . 
ID No 



1 


1^34.9 


2 


1749 


3 


1186.5 


4 


1317.7 


5 


1571.6 


6 


1571.6 


7 


1297.5 


8 


1297.5 


9 


1297.5 


10 


1593.8 


11 


1393.7 


12 


1741.8 


13 


848.8 


14 


723.9 


15 


636.8 


16 


524 .6 


17 


1251.4 


18 


1194.4 


19 


700.7 


20 


700.7 


21 


764 .9 


22 


1169 .3 


23 


1047 .2 


24 


1139 .3 


25 


1189 .4 


26 


613 .7 


27 


1323 .5 


28 


2496 .7 


29 


1551 .8 


30 


1803 .0 


31 


1172 .4 


32 


2148 .5 


33 


2553 .9 


34 


1154 .3 


35 


1174.5 


36 


2274 .6 



2 
2 



13 
14 
14 
15 
15 
16 
16 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 



1- 
2 



Amino Acid Sequence 
of Peptides used 
in the Search 

£>Lr5Wtaadtaaq i 
dlr swtaadtaaq i tq 
matpllmqalp 
matpllmqalp 

EGVNDNEEGFFSARf- 
EGVNDNEEGFFSAR 
DRVYIHPFHL (+2) 
DRVYIHPFHL { +2 )• 
DRVYIHPFHL ( +3 ) 
VEADVAGHGOD ILIR 
HGVTVLTALGAI LK 
HSGQAEGYSYTDANIK 

HSG0AEGY 2 i + D 
MAFGGLK ' ( +1 ) 
GATLFK (+1/ [QATLFG, KTLFK] 

TEFKC + D - 0 
DRNDLLTYLK* ' z 
VLVLDTDYKK^ 
CRGDSY 1 (CGRDSY) 
CRGDSYM+l) 
KGATLFK 
TGPNLHGLFGR 
DRVYIHPF 
TLLVGE SATTF { + 1 ) 
RNVIPDSKY 
SSPLPL(+1> 

LARNCQPNYW{C«161 .17) 
AQSMGFINEDLSTSAQALMSDW 

VTLIHPIAMDDGLR 
GGDTVTLNETDLTQI PK 
VGEEVE I VG I K 

GWQVPAFTLGGEATDIWMR — 
VASISLPTSCASAGTQCLISGWGNTK 39 

SSGTSYPDVLK 1 *° 

TLNNDIMLIK - " 

SI VHPSYNSNTLNNDIMLTK 42 

not present in the genpept database 
sequence appended to the human database, 



Genpept Genpept Species 
Database r>arahase J specific 



l 
1 

61 
1* 
1* 

1 
2 
1 
1 
1 
1 
1 



1 

5* 

6 



3 
1 

1 
1 
2 
1 
1 
3 
2 
1 
1 



1 
1 

61 

1 

1 

1 

2 

1 

1 

1 

1 

1 



1 
5 
6 
1 

3 
1 

1 

1 

4 

1 

1 

3 

2 

1 

1 

1 

3 

1 

2 



1 
1 

13 

17 

1 

1 

1 

2 

1 

1 

1 

1 

1 

6 

5 

1 

2 

1 

7 

1 

1 

7 

1 

1 

2 

1 

1 

1 

1 

1 

1 

■1 

1 

1 

1 



not originally in human 



3 SdS^lcid sequences added to database 

(-) not in the top 100 answers 

* peptide of similar sequence identified 



Example #3 

Much of the information generated by the genome 
projects will be in the form of nucleotide sequences. Those 
stretches of nucleotide sequence that can be correlated to a 
gene will be translated to a protein sequence and stored in a 
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specific database (genpept) . The un-translated nucleotide 
sequences represent a wealth of data that may be relevant to 
protein sequences. The present invention will allow searching 
the nucleotide database in the same manner as the protein 
sequence databases. The procedure will involve the same 
algorithmic approach of cycling through the nucleotide 
sequence. The three-base codon will be converted to a protein 
sequence and the mass of the amino acids summed. To cycle 
through the nucleotide sequence, a one-base increment will be 
used for each cycle. This will allow the determination of an 
amino acid sequence for each of the three reading frames in 
one pass. For example, an MS /MS spectrum is generated for the 
sequence Asp-Leu-Arg-Ser-Trp-Thr-Ala [Seq. ID No. 43] 

((M+H)+ =848) the algorithm will search the nucleotide sequence 

in the following manner. 

pt»q. ID No. 

Nucleotide sequence from the database, 
nucleotides GCG AUC UCC GGU CUU GGA CUG CUC 
First pass through the sequence. 

nucleotides GCG AUC UCC GGU CUU GGA CUG CUC Mass 44 

amino acids Ala He Ser Gly Leu Gly Leu Leu 743 
Second pass through the sequence. 

nucleotides G CGA UCU CCG GUC UUG GAC UGC UC Mass 44 
amino acids Arg Ser Pro Val Leu Gly Leu 741 

Third pass through the sequence. 

nucleotides GC GAU CUC.CGG UCU UGG ACU GCU C Mass 44 
amino acids Asp Leu Arg Ser Trp Thr Ala 848 
Fourth pass through the sequence. 

nucleotides GCG AUC UCC GGU CUU GGA CUG CUC Mass 44 

amino acids He Ser Gly Leu Gly Leu Leu 672 4 5 

As the sequence of amino acids match the mass of the peptide 

the predicted sequence ions will be compared to the MS/MS 

spectrum. From this point on the scoring and reporting 

procedures for the search will be the same as for a protein 

sequence database. 

in light of the above description, a number of 
advantages of the present invention can be seen. The present 
invention permits correlating mass spectra of a protein, 
peptide or oligonucleotide with a nucleotide or protein 
sequence database in a fashion which is relatively accurate,_ 
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rapid, and which is amenable to automation (i.e., to operation 
without the need for the exercise of human judgment) . The 
present invention can be used to analyze peptides which are 
derived from a mixture of proteins and thus is not limited to 
analysis of intact homogeneous proteins such as those 
generated by specific and known proteolytic cleavage. 

A number of variations and modifications of this 
invention can also be used. The invention can be used in 
connection with a number of different proteins or peptide 
sources and it is believed applicable to any analysis using 
mass spectrometry and proteins. In addition to the examples 
described above, the present invention can be used for, for 
example, monitoring fermentation processes by collecting 
cells lysing the cells to obtain the proteins, digesting the 
proteins, e.g. in an enzyme reactor, and analyzing by Mass 
spectrometry as noted above. In this example, the data could 
be interpreted using a search of the organism's database 
(eg a yeast database) . As another example, the invention 
could' be used to determine the species of organism from which 
a protein is obtained. The analysis would use a set of 
peptides derived from digestion of the total proteins. Thus, 
the cells from the organism would be lysed, the proteins 
collected and digested. Mass spectrometry data would be 
collected with the most abundant peptides. A collection of 
spectra (e.g., 5 to 10 spectra) would be used to search the 
entire database. The spectra should match known proteins of 
the species. Since this method would use the most abundant 
proteins in the cell, it is believed that there is a high 
likelihood the sequences for these organisms would be 
sequenced and in the database. In one embodiment, relatively 
few cells could be used for the analysis (e.g., on the order 

of 10 4 ~ 10 5 ) . ^ 

For example, methods of the invention can be used to 

identify microorganisms, cell surface proteins and the like. 
For identifying microorganisms, the procedure can employ 
tandem mass spectra obtained from peptides produced by 
proteolytic digestion of the cellular proteins. The complex 
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mixture of peptides produced is subjected to separation by 
HPLC on-line to a tandem mass spectrometer. As peptides elute 
off the column tandem mass spectra are obtained by selecting a 
peptide ion in the first mass analyzer, sending it into a 
collision cell, and recording the mass-to-charge (m/z) ratios 
of the resulting fragment ions in the second mass analyzer. 
This process is performed over the course of the HPLC analysis 
and produces a large collection of spectra (e.g., from 10 to 
200 or more) . Each spectrum represents a peptide derived the 
microorganism's protein (gene) pool and thus the collection 
can be used to develop one or more family, genus, species, 
serotype or strain-specific markers of the microorganism, as 

desired. 

The identification of the microorganism is performed 
using one of at least three software related techniques. In a 
first technique, a database search, the tandem mass spectra 
are used to search protein and nucleotide databases to 
identify an amino acid sequence which is represented by the 
spectrum. Identification of the organism is achieved when a 
preponderance of spectra obtained in the mass spectrometry 
analysis match to proteins previously identified as coming 
from a particular organism. Means for searching databases in 
this fashion are as described hereinabove. 

In a second technique a library search can be 
performed, such as if no solid matches are observed using the 
database search described above. In this approach the data 
set is compared to a pre-defined library of spectra obtained 
from known organisms. Thus, initially a library of peptide 
spectra is created from known microorganisms. The library of 
tandem mass spectra for micro-organisms can be constructed by 
any of several methods which employ LC-MS/MS. The methods can 
be used to vary the location cellular proteins are obtained 
from, and the amount of pre-purif ication employed for the 
resulting peptide mixture prior to LC-MS/MS analysis. For 
example, intact cells can be treated with a proteolytic enzyme 
such as trypsin, chymotrypsin, endoproteinase Glu-C, 
endoproteinase Lys-C, pepsin, etc. to digest the proteins 
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exposed on the cell surface. Pre-treatment of the intact 
cells with one or more glycosidases can be used to remove 
steric interference that may be created by the presence of 
carbohydrates on the cell surface. Thus, the pre-treatment 
with glycosidases may be used to obtain higher peptide yields 
during the proteolysis step. A second method to prepare 
peptides involves rupturing the cell membranes (e.g., by 
sonication, hypo-osmotic shoe*, f reeze-thawing, glass beads, 
etc.) and collecting the total proteins by precipitation, 
e g using acetone or the like. The proteins are resuspended 
in a' digestion buffer and treated with a protease such as 
trypsin, chymotrypsin, endoproteinase glu-C, endoproteinase 
lys-C etc. to create a mixture of peptides. Partial 
simplification of this mixture of peptides , such as by 
partitioning the mixture into acid and basic fractions or by 
separation using strong cation exchange chromatography, leads 
to several pools of peptides which can then be used in the 
xnass spectrometry process. The peptide mixtures are analyzed 
by LC-MS/MS, creating a large set of spectra, each 
representing a unique peptide marker of the organism or cell 

type. . , _ 

The data are stored in the library in any of a 

variety of means, but conveniently in three sections, wherein 
one section is the peptide mass determined from the spectrum, 
a second section is information related to the organism, 
species, growth conditions, etc., and a third section contains 
species, g can be stored in a variety 

the mass/ intensity data. 

. an ASCII format or in a binary 

of formats, conveniently an as>uai 

format. ^ foTJa the lib rary search spectra are compared 
by first determining whether the mass of the peptide is within 
a preset mass tolerance (typically about ± 1-3 amu) of the 
library spectrum; a cross-correlation function as described 
hereinabove is used to obtain a quantitative value of the 
similarity or closeness-of-f it of the two spectra. The 
process is similar to the database searching algorithm except 
a spectrum is not reconstructed for the amino acid sequence. 



WO 95/25281 



PCTAJS95/03239 



32 



To provide a set of comparison spectra the tandem mass 
spectrum can be used to search a small (e.g., "100 protein 
sequences) randomly generated sequence database. This 
provides a background against which similarity is compared and 
to generate a normalized score. 

A third related technique for identifying a 
microorganism or cell involves de novo interpretation to 
determine a set of amino acid sequences that have the same 
mass as the peptide represented by the spectrum. The set of 
amino acid sequences is limited by using the spectral pre- 
processing equation 1, above, to rank the sequences. This set 
of amino acid sequences then serves as the database for use in 
the searching method described hereinabove. An ammo add 
sequence is thereby derived for a tandem mass spectrum that is 
not contained in the organized databases. By using 
phylogenetic analysis of the determined amino acid sequences 
they can be placed within a species, genus or family and a 
classification of the microorganism is thereby accomplished. 

The methodology described above has applications in 
addition to identifying microorganisms . For example, cDNA 
sequencing can be carried out using conventional means to 
obtain partial sequences of genes expressed in particular cell 
lines, tissue types or microorganisms. This information then 
serves as the database for the subsequent analyses. The 
approach described above for the digesting proteins exposed on 
the cell surface by enzymatic digestion can be used to 
generate a collection of peptides for LC-MS/MS analysis. The 
resulting spectra are used to search the nucleotide sequences 
in all 6 reading frames to match amino acid sequences to the 
MS/MS spectra. The amino acid sequences identified represent 
regions of the cell surface proteins exposed to the 
extracellular space. This method provides at least two 
additional pieces of information not directly obtainable from 
CDNA sequencing. First, the spectra identify the proteins 
residing on the membrane of the cells. Secondly, sidedness 
information is obtained about the folding of the proteins on 
the cell surface. The peptide sequences matched to the 
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nucleotide sequence information identifies those segments of 
the protein sequence exposed extracellularly . 

The methods can also be used to interpret the MS/MS 
spectra of carbohydrates. In this method the carbohydrate (s) 
of interest is subjected to separation by HPLC on-lxne to a 
tandem mass spectrometer as with the peptides. The 
carbohydrates can be obtained from a complex mixture of 
carbohydrates or obtained from proteins, cells, etc. by 
chemical or enzymatic release. Tandem mass spectra are 
obtained by selecting a carbohydrate ion in the f xrst mass 
analyzer, sending it into a collision cell, and recording the 
ro ass-to-charge (m/z) ratios of the resulting fragment xons in 
the second mass analyzer. This process is performed -- the 
course of the HPLC analysis and produces a large collectxon of 
spectra (e.g., from 10 to 200 or more). The ~»t£«n 
patterns of the carbohydrate structures contaxned xn the 
database can be predicted and a theoretical representatxon of 
S rectra can be compared to the pattern in the tandem mass 
spectrum by using the method described hereinabove The 
carbohydrate structures analyzed by tandem mass spectrometry 
carbohyarar These me thods can thus be used for 

can thereby be xdentxtxea. 

characterization of the carbohydrate structures found on 
proteins, ceil ---^ _ fce _ a in connect ion with 
diagnostic applications, such as described above and xn 
Example 2 . Another example involves identifying virally 
Infected cells . Success of such an approach is believed to 
d^end on the relative abundance of the viral proteins versus 
the cellular proteins, at least using present equipment If 
.„ antibody were produced to a specific region of a proton 
common to certain pathogens, the fixture of proteins could be 
partially fractionated by passing the material over an 
immunoaffinity column. Bound proteins are 
digested. Mass spectrometry generates the data to 
database. One important element is finding a general handle 
to pull proteins from the cell. This approach could also be 
used to analyze specific diagnostic proteins. Tor example, if 
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a certain protein variant is known to be present in some for* 
of cancer or genetic disease, an antibody could be produced to 
a region of the protein that does not change. An 
immunoaff inity column could be constructed with the antibody 
to isolate the protein away from all the other cellular 
proteins. The protein would be digested and analyzed by 
tandem mass spectrometry. The database of all the possible 
mutations in the protein could be maintained and the 
experimental data analyzed against this database. 

one possible example would be cystic fibrosis. This 
disease is characterized by multiple mutations in CFTR 
protein. One mutation is responsible for about 70% of the 
cases and the other 30% of the cases result from a wide 
variety of mutations . To analyze these mutations by genetic 
testing would require many different analyses and probes. In 
the assay described above, the protein would be isolated and 
analyzed by tandem mass spectrometry. All the mutations in 
the protein could be identified in an assay based on 
structural information. The database used would preferably 
describe all the known mutations. Implementation of this 
broach is believed to involve significant difficulties The 
amount of protein retired could be so large that it would be 
impractical to obtain from a patient. This problem may be 
overcome as the sensitivity of mass spectrometry improves in 
the future. A protein such as CFTR is a transmembrane 
protein, which are typically very difficult to manipulate and 
digest. The approach described could be used for any 
dilgnostic protein. The data would be highly specific and the 
data analysis essentially automated. be 

It is believed that the present invention can be 
used with any size peptide. The process requires that 
peptides be fragmented and there are methods for achieving 
fragmentation of very large proteins. Some such techniques 
are described in Smith et al., "Collisional Activation and 
Collision-Activated Dissociation of Large Multiply Charged 
Polypeptides and Proteins Produced by Electrospray Ionization 

c ^ Masfi soect. I: 53-65 (1990). The present method 
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can be used to analyze data derived from intact proteins, in 
that there is no theoretical or absolute practical limit to 
the size of peptides that can be analyzed according to this 
invention. Analysis using the present invention has been 
performed on peptides at least in the size range from about 
400 amu (4 residues) to about 2500 amu (26 residues) . 

in . described embodiments candidate sub-sequences are 
identified and fragment spectra are predicted as they are 
needed, at the time of doing the analysis. If sufficient 
computational resources and storage facilities are available 
to perform some or all of the calculations needed for 
candidate sequence identification (such as calculation of sub- 
sequence masses) and/or spectra prediction (such as 
calculation of fragment masses) , storage of these items in a 
database can be employed so that some or all of these items 
can be looked up rather than calculated each time they are 

needed. 

While the present invention has been described by 
way of the preferred embodiment and certain variations and 
modifications, other variations and modifications of the 
present invention can also be used, the invention being 
described by the following claims. 
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WHAT TS C T.ATMED IS: 

1. A method for correlating a peptide fragment 
mass spectra with amino acid sequences derived from a 
database of sequences, comprising: 

storing data representing a first mass spectrum of a 
plurality of fragments of at least a first peptide; 

calculating a plurality of predicted mass spectra of 
at least a portion of a plurality of said sequences in said 

database of sequences; and 

calculating at least a first measure for each of 
said plurality of predicted mass spectra, said first measure 
being an indication of the closeness-of-f it between said first 
ma ss spectrum and each of said plurality of mass spectra. 

2. A method, as claimed in claim 1, wherein said 
first mass spectrum is provided from a tandem mass 
spectrometer device. 

3 A method, as claimed in claim 2, wherein the 
tandem mass' spectrometer is one of a triple quadrupole mass 
spectrometer, a Fourier-transform cyclotron resonance mass ^ 
spectrometer, a tandem time-of -flight mass spectrometer and a 
quadrupole ion trap mass spectrometer. 

4. a method, as claimed in claim 1, wherein said 

4» a rtai-ahase of amino acid sequences of 
database of sequences is a database ox 



L 
2 

3 a plurality of proteins 



5. A method, as claimed in claim 1, wherein said 
database of sequences is a nucleotide database. 



6 A method, as claimed in claim 1, further 
comprising selecting a first plurality of *~ 
said database of sequences, wherein said step of calculating 
plurality of predicted mass spectra includes calculating at 
least one predicted mass spectrum for each of said selected 
6 first plurality of sub-seguences . 
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7. A method, as claimed in claim 1, wherein said 
step of calculating a first measure includes selecting those 
values from said first mass spectrum having an intensity 
greater than a predetermined threshold. 

8. A method, as claimed in claim 1, further 
comprising normalizing said first spectrum prior to said step 
of calculating at least a first measure. 

9. A method, as claimed in claim 1, wherein said 
step of calculating a plurality of predicted mass spectra 
includes calculating predicted mass spectra for only a portion 
of said sequence database. 

10. A method, as claimed in claim 9, wherein said 
first peptide is derived from a protein which is obtained from 
a first organism and wherein said protein of said sequence 
database is the portion containing sequences for proteins 
found in said first organism. 

11. A method, as claimed in claim 2 wherein a first 
mass spectrometer of said tandem mass spectrometer device is 
used to separate-out a component having a first mass, an . 
activation device of said mass spectrometer device is used to 
fragment said first component and a second mass spectrometer 
of said tandem mass spectrometer device is used provide said 
first mass spectrum. 

12. A method, as claimed in claim 1, wherein said 
first peptide is isolated by chromatography. 

13. A method, as claimed in claim 1, wherein said 
data representing said first mass spectrum includes a 
plurality of mass-charge pairs. 
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14. A method, as claimed in claim 1, wherein said 
step of calculating predicted mass spectra comprises: 

deriving a plurality of masses from portions of said 
plurality of sequences, each mass equal to the mass of a 
peptide fragment which corresponds to a portion of a sequence 

6 in said plurality of sequences; 

7 selecting those masses, among said plurality of 
masses, which are within a predetermined mass tolerance of the 
mass of said first peptide and storing an indication of which 
portion of which sequence each of said selected masses 
corresponds to, to provide a plurality of candidate sequence 

12 portions ; and 

13 calculating a plurality of mass-charge pairs for 
each of said candidate sequence portions, each of said mass- 
charge pairs having a mass substantially equal to the mass of 
a peptide fragment corresponding to a portion of one of said 

17 candidate sequence portions. 
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15. A method, as claimed in claim 1, wherein said 
first measure comprises a correlation coefficient. 



16. A method, as claimed in claim 1, wherein said 

2 step of calculating a first measure comprises: 

3 calculating a preliminary score for each of said 

4 plurality of candidate sequence portions; 

identifying a plurality of primary candidate 

portions which have a preliminary score which is greater than 

at least one candidate sequence which is not identified as a 

8 primary candidate portion; and 

9 calculating a correlation coefficient for each of 
10 said primary candidate portions. 



17. A method, as claimed in claim 8, wherein each 
of -said plurality of mass spectra and said first mass spectrum 
includes a plurality of mass-charge pairs, each mass-charge 
pair having an intensity value, and further comprising the 
step of identifying, for each of said plurality of mass 
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spectra, a set of matched fragments which have less than a 
predetermined difference from corresponding fragments in said 

first mass spectrum; and 

wherein said preliminary score is the number of 
fragments of a predicted mass spectrum in said set of matched 
fragments multiplied by the sum of the intensity values for 
the mass-charge pairs corresponding to said matched fragments. 

18 . A method for interpreting the mass spectrum of 
an oligonucleotide comprising: 

providing a library of nucleotide sequences; 

storing, in a database, a plurality of nucleotide 
sub-sequences from said library, said plurality including all 

sequences smaller than n-mers; 

storing data representing a first mass spectrum of « 
plurality of fragments of said oligonucleotide; 

calculating predicted mass spectra for each of said 
plurality of nucleotide sub-sequences; and 

calculating at least a first closeness-of-f it 
measure for each of said predicted mass spectra, with respect 
to said first mass spectrum. 



19. 

10. 



A method, as claimed in claim 18, wherein n is 



20. A method for determining whether a peptide in 
mixture of proteins is homologous to a portion of any of a 
plurality of proteins specified by an amino acid sequence in 
database of sequences, comprising: 

using a tandem mass spectrometer to receive a 
plurality of peptides obtained from said mixture of proteins, 
to select at least a first peptide from said mixture of 
peptides, to fragment said first peptide and to generate a 
peptide fragment mass spectrum; 

storing data representing said first mass spectrum 

and 
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correlating said mass spectrum with an amino acid 
sequence in said database of sequences, to determine the 
correspondence of a protein specified in said sequence 
database with a protein in said mixture of proteins. 

21 A method, as claimed in claim 20, wherein said 
step of correlating includes predicting at least one mass 
spectrum from said amino acid sequence. 

22 A method according to claim 20 wherein the 
mixture of proteins is obtained from a cell or microorganism 

3 to be identified. 

23 A method according to claim 22, wherein the 

■!= retained by proteolytic digestion of 
mixture of proteins is obtamea oy P 

3 cellular proteins. 

24. The method of claim 23, wherein the cellular 
proteins are extracellular. 
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25. A method for identifying an organism of 
interest by determining whether a ..ss spectrum or a plurality 
of mas spectra of peptides obtained from T^TuLary 
components thereof to be identified „ contained in a Hbrary 
of spectra of known organisms, comprising: 

using a tandem mass spectrometer to receive a 
Plurality of peptides obtained from a mixture of proteins 
plurality or p P identified, to select at 

gained from plurality o£ peptides, to 

fragment silo first peptide and to generate a peptide fragment 
mass spectrum,^ ^ representing said first mass spectrum, 

correlating said mass spectrum with a mass spectrum 
• ..id library of spectra of known organisms to determine the 
corret" of said spectra with the spectra obtained from 
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„ peptides obtained from the organism to be identified, thereby 

18 .identifying said organism. 
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26 . The method of claim 25, wherein the organism to 
be identified is a bacterium, fungus or virus. 

27 . The method according to claim 25, wherein the 
mixture of proteins is obtained by enzymatic digestion of the 

3 organism's proteins. 

28 A method for characterizing a carbohydrate 
structure of interest from a mixture of carbohydrates, 

using a tandem mass spectrometer to receive a 

rrrr«:°orLboh y ™ ta . r — :r -~ - 

the tandem Bass spectrometer, to fragment said fxrst 
^rbohydrate and to generate a carbohydrate fragment mass 

spectrum; ^ repreS enting said first mass spectrum; 

correlating said mass spectrum with a database of 
spectra of Known carbohydrates, to determine . the 
correspondence of a carbohydrate specif ud xn =" d 
carbohydrate database with a carbohydrate m said 
carbohydrates, thereby characterizing the structure of the 
18 carbohydrate of interest. 

* or wherein the mixture of 

29 The method of claim 28 , wnerem 

^ nhtained from a glycosylated protein of 
carbohydrates is obtained xruw v 



1 
2 

3 comprising: 

4 

5 

6 

7 

8 

9 
10 
11 

12 and 
13 
14 
15 
16 
17 



1 

2 

3 interest . 



1 
2 
3 



30 The method of claim 29, wherein the mixture of 
rhDhvdrate ; is obtained from a glycosylated protein of 

or enzymatic release from the protean. 
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